R Basic Concepts:
Subsetting

Jürgen Wilbert

University of Potsdam

11/17/22

Subsetting

Selecting elements of a data structure.

Selecting elements with square brackets

By providing a number within square brackets, the respective element is selected from a vector:

names <- c("Sheldon", "Leonard", "Penny", "Amy")
names[1]
[1] "Sheldon"

When you provide a vector of numbers, multiple elements are selected

names[c(1,4)]
[1] "Sheldon" "Amy"    

You can even change the order or repeat elements:

names[c(4, 1, 1)]
[1] "Amy"     "Sheldon" "Sheldon"

With negative numbers, columns are dropped:

names[-1]
[1] "Leonard" "Penny"   "Amy"    
names[c(-1, -3)]
[1] "Leonard" "Amy"    

Task

Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"

:-)

Task - solution

Take the vector
names <- c("Sheldon", "Leonard", "Penny", "Amy")
and reorder it to get the following result:
[1] "Sheldon" "Amy" "Sheldon" "Amy" "Leonard" "Penny"

x <- c(1, 4, 1, 4, 2, 3)
new_order <- names[x]
new_order
[1] "Sheldon" "Amy"     "Sheldon" "Amy"     "Leonard" "Penny"  

Subsetting data frames

Firstly, we create an example data frame:

study <- data.frame(
  sen    = c(0, 1, 0, 1, 0, 1),
  gender = c("M", "M", "F", "M", "F", "F"),
  age    = c(12, 13, 11, 10, 11, 14),
  IQ     = c(90, 85, 90, 87, 99, 89)
)
study
sen gender age IQ
0 M 12 90
1 M 13 85
0 F 11 90
1 M 10 87
0 F 11 99
1 F 14 89

Square brackets select a column of a data frame either by a number the column name:

study[3]
age
12
13
11
10
11
14
study["age"]
age
12
13
11
10
11
14

The subsetted object is a data frame with one column.
This is different from extracting a variable with $ or [[ signs:

study[["age"]]
[1] 12 13 11 10 11 14
study$age
[1] 12 13 11 10 11 14

which returns a vector (!)

While this works:

median(study[["age"]])
[1] 11.5

this throws an error:

median(study["age"])
Error in median.default(study["age"]) : need numeric data

Providing a vector will select multiple columns:

study[c(1,3)]
sen age
0 12
1 13
0 11
1 10
0 11
1 14
study[c("sen", "age")]
sen age
0 12
1 13
0 11
1 10
0 11
1 14

Extraction and subsetting

The extraction of a vector and the selection of elements can be combined:

age <- study[["age"]]
age[c(2,4)]
[1] 13 10

Or within one step:

study$age[c(2,4)]
[1] 13 10
study[["age"]][c(2,4)]
[1] 13 10

Selecting rows and columns

Specific cases are selected within square brackets: object_name[rows, columns].

study[5, ]  # filter a row
sen gender age IQ
5 0 F 11 99
study[c(2, 6), ] # filter two rows
sen gender age IQ
2 1 M 13 85
6 1 F 14 89
study[c(2, 6), "IQ"]
[1] 85 89
study[c(2, 6), c("sen", "IQ")]
sen IQ
2 1 85
6 1 89

You could also use numbers to address the columns:

study[, 2]
[1] "M" "M" "F" "M" "F" "F"
study[c(2, 6), c(1, 3)]
sen age
2 1 13
6 1 14

Task

Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.

:-)

Task - solution

Please create a new data frame (study2) comprising the gender and age variables for the cases 1, 3, and 5 of the study data frame.

study2 <- study[c(1, 3, 5), c("gender", "age")]
study2
gender age
1 M 12
3 F 11
5 F 11

Sophisticated subsetting

Subsetting becomes most powerful when it is combined with conditional selections.

For example:

  • Select all students with special educational needs.
  • Select all male students between the age of 6 and 10

To apply such selections, we have to know about relational and logical operators.

Relational operators

Relational operators compare two values and return a logical value (TRUE or FALSE)

Operator Relation Example
== is identical x == y
!= is not identical x != y
> is greater x > y
>= is greater or identical x >= y
< is less x < y
<= is less or identical x <= y

Examples

7 > 2
[1] TRUE
7 <=  10
[1] TRUE
5 == 4
[1] FALSE
5 != 6
[1] TRUE

Relational vectors and characters

Only == and != can be applied to non numerical objects:

"Hamster" == "Mouse"
[1] FALSE
"Hamster" != "Mouse"
[1] TRUE

Relational operators and vectors

age <- c(12, 4, 3, 8, 4, 2, 1)
age < 5
[1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

This behavior is called recycling as is implemented in many (but not all!) R functions.

recycling: An operation is applied to each element of a vector and a vector is returned.

age age < 5
12 FALSE
4 TRUE
3 TRUE
8 FALSE
4 TRUE
2 TRUE
1 TRUE

Using logical vectors to select values

When you put a logical vector within square brackets [ ] after an object, all elements of that object with a TRUE in the logical vector are selected:

age <- c(12, 4, 3, 8)
x <- age > 5
x
[1]  TRUE FALSE FALSE  TRUE
age[x]
[1] 12  8

Using logical vectors to select values

age <- c(12, 4, 3, 8)
x <- age > 5
age[x]
age x <- age > 5 Select? Result
12 TRUE select 12
4 FALSE drop
3 FALSE drop
8 TRUE select 8

Task

Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.

:-)

Task - solution

Create a new vector friends <- c(4, 5, 6, 3, 7, 2, 3).
Show all values of that vector >= 4.

friends <- c(4, 5, 6, 3, 7, 2, 3)
friends[friends >= 4]
[1] 4 5 6 7

which()

The which() functions gives the indices of the elements that are TRUE.
It takes a logical vector as an argument.

x <- c(TRUE, FALSE, FALSE, TRUE)
which(x)
[1] 1 4

which() can handle missing values:

x <- c(TRUE, FALSE, NA, FALSE, TRUE, NA)
which(x)
[1] 1 5
age <- c(12, 4, 3, 8)
x <- age < 5
x
[1] FALSE  TRUE  TRUE FALSE
which(x)
[1] 2 3
age <- c(12, 4, 3, 8)
x <- age < 5
x
which(x)
age[which(x)]
Index age x <- age < 5 which(x) age[which(x)]
1 12 FALSE
2 4 TRUE 2 4
3 3 TRUE 3 3
4 8 FALSE

Why use which?

age = c(NA, 12, 4, 3, NA, 8, 7, 4, 3, 6, 4, 3)
x <- age < 6
x
 [1]    NA FALSE  TRUE  TRUE    NA FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
age[x]
[1] NA  4  3 NA  4  3  4  3
mean(age[x])
[1] NA
mean(age[which(x)])
[1] 3.5

Task

Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.

:-)

Task - solution

Create a vector x <- c(1, 4, 5, 3, 4, 5) and identify:
1. Which elements are larger or equal than three?
2. Create a new vector from x containing all elements that are not four. Note: Use the which() function for this task.

x <- c(1, 4, 5, 3, 4, 5)
which(x >= 3)
[1] 2 3 4 5 6
y <- x[which(x != 4)]
y
[1] 1 5 3 5

Selecting cases with logical vectors

Logical vectors can also be appplied to data frames for selecting cases.

Let us take an example data frame:

study <- data.frame(
  sen    = c(0, 1, 0, 1, 0, 1),
  gender = c("M", "M", "F", "M", "F", "F"),
  age    = c(12, 13, 11, 10, 11, 14),
  IQ     = c(90, 85, 90, 87, 99, 89)
)

Select with bracket subsetting or the which() function:

study_no_sen <- study[study[["sen"]] == 0, ]
study_no_sen
sen gender age IQ
1 0 M 12 90
3 0 F 11 90
5 0 F 11 99
# Or using the which() function
filter <- which(study[["sen"]] == 0)
study_no_sen <- study[filter, ]

Task

Calculate the mean of IQ for students with and without sen.

:-)

Task - solution

Calculate the mean of IQ for students with and without sen.

filter <- which(study[["sen"]] == 0)
mean(study[["IQ"]][filter])
[1] 93
filter <- which(study[["sen"]] == 1)
mean(study[["IQ"]][filter])
[1] 87

Logical Operations

Logical operations are applied to logical values.

Operator Operation Example Results
! Not ! x TRUE when x = FALSE and FALSE when x = TRUE
& AND x & y TRUE when x and y are TRUE else FALSE
| OR x | y TRUE when x or y is TRUE else FALSE

Note: To get the | sign:
On a german Mac keyboard press: option + 7
On a german Windows keyboard press: AltGr + <

Example

x <- TRUE
y <- FALSE


!x
[1] FALSE
!y
[1] TRUE
x & y
[1] FALSE
x | y
[1] TRUE

Logical Operator with vectors

When applied to vectors, logical operations result in a new vector.
Operations are applied to each element one by one.

x <- c(TRUE, FALSE, TRUE,  FALSE)
y <- c(TRUE, FALSE, FALSE, TRUE)
!x
[1] FALSE  TRUE FALSE  TRUE
x & y
[1]  TRUE FALSE FALSE FALSE
x | y
[1]  TRUE FALSE  TRUE  TRUE

Task

Create two vectors:

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)  
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)

Determine for each element whether glasses and hyperintelligent are TRUE at the same time.

:-)

Task - solutions

Create two vectors:

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)  
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)

Determine for each element whether glasses and hyperintelligent are TRUE at the same time.

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
glasses & hyperintelligent
[1]  TRUE FALSE FALSE  TRUE FALSE
glasses hyperintelligent glasses & hyperintelligent
TRUE TRUE TRUE
TRUE FALSE FALSE
FALSE FALSE FALSE
TRUE TRUE TRUE
FALSE FALSE FALSE

sum() and mean() with logical vectors:

When a logical vector is applied to a numeric function (e.g. mean() or sum()), TRUE is counted as 1 and FALSE as 0:

sum() then gives the number of elements that are TRUE.
mean() gives the proportion of elements that are TRUE.

# e.g.:
sum(c(TRUE, FALSE, TRUE))
[1] 2
mean(c(TRUE, FALSE, TRUE, FALSE))
[1] 0.5

Task

Take the data from the last example and calculate the sum and proportion of cases that wear glasses and are hyperintelligent.

glasses <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
hyperintelligent <- c(TRUE, FALSE, FALSE, TRUE, FALSE)

:-)

Task - solutions

Take the data from the last example and calculate the sum and proportion of cases that wear glasses and are hyperintelligent.

sum(glasses & hyperintelligent)
[1] 2
mean(glasses & hyperintelligent)
[1] 0.4

Combining logical and relational operators

age <- c(12, 4, 3, 8, 4, 2, 1, 7, 4)
gender <- c(0, 1, 0, 1, 0, 0, 0, 0, 1)
age > 4
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
gender == 0
[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
age > 4 & gender == 0
[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

Task

Create a vector
income <- c(5000, 4000, 3000, 2000, 1000) and a vector
happiness <- c(20, 35, 30, 10, 50).

  1. Use relational and logical operations to determine for each element whether the income is larger than 2500 and at the same time happiness is above 25.

  2. Calculate the proportion.

:-)

Task - solution

  1. Use relational and logical operations to determine for each element whether the income is larger than 2500 and at the same time happiness is above 25.

  2. Calculate the proportion.

income <- c(5000, 4000, 3000, 2000, 1000)
happiness <- c(20, 35, 30, 10, 50)
income > 2500 & happiness > 25
income happiness income > 2500 happiness > 25 income > 2500 &
happiness > 25
5000 20 TRUE FALSE FALSE
4000 35 TRUE TRUE TRUE
3000 30 TRUE TRUE TRUE
2000 10 FALSE FALSE FALSE
1000 50 FALSE TRUE FALSE

… and the proportion

mean(income > 2500 & happiness > 25)
[1] 0.4

Subsetting data frames with logical and relational operators

study
sen gender age IQ
0 M 12 90
1 M 13 85
0 F 11 90
1 M 10 87
0 F 11 99
1 F 14 89
filter <- study[["sen"]] == 1 & study[["gender"]] == "M"
study[filter, ]
sen gender age IQ
2 1 M 13 85
4 1 M 10 87

Task

Use the ChickWeight data frame for the following task.
The data set is already included in R.

  1. Look into the data set with ?ChickWeight.
  2. Get all variable names of the data frame with the names() function (names(ChickWeight)).
  3. Select cases from ChickWeight with Diet == 1 and Time < 16.
  4. For these cases, calculate the correlation between weight and Time. Note: Use the cor() function (e.g., cor(x, y))
  5. Repeat steps 3 and 4 for Diet == 4.
  6. What can you see?

:-)

filter <- ChickWeight[["Diet"]] ==  1 & ChickWeight[["Time"]] < 16
diet1 <- ChickWeight[filter,]
cor(diet1[["weight"]], diet1[["Time"]])
[1] 0.8109772


filter <- ChickWeight[["Diet"]] ==  4 & ChickWeight[["Time"]] < 16
diet4 <- ChickWeight[filter,]
cor(diet4[["weight"]], diet4[["Time"]])
[1] 0.9720822

The correlation is larger for Diet 4. This suggests that Diet 4 has a stronger impact an the chicken’s weight.

The subset() function

R comes with a function to make subsetting a bit more straight forward.

subset() has the main arguments:

  • x : A data.frame
  • subset : A logical vector for filtering rows
  • select : expression, indicating columns to select from a data frame

and returns a data.frame.

subset(study, gender == "F" & IQ > 89, c(sen, gender, IQ))
sen gender IQ
3 0 F 90
5 0 F 99

Variable names must be provided without quotes and without the name of the data.frame.

Task

Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.

:-)

Task - solutions

Take the mtcars dataset and filter cases (here: car models) with 6 cylinders (variable cyl) and automatic transmission (value 1 in variable am).
Select the variables mpg, am, gear, cyl.
Use the subset function.

subset(mtcars, cyl == 6 & am == 1, c(mpg, am, gear, cyl))
mpg am gear cyl
Mazda RX4 21.0 1 4 6
Mazda RX4 Wag 21.0 1 4 6
Ferrari Dino 19.7 1 5 6

So many ways of subsetting … an overview

Subset a data frame (and get a new data frame)

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, 
       c("mpg", "am", "gear", "cyl")]

mtcars[mtcars$cyl == 6 & mtcars$am == 1, c("mpg", "am", "gear", "cyl")]

subset(mtcars, cyl == 6 & am == 1, c(mpg, am, gear, cyl))

with(mtcars, 
  mtcars[cyl == 6 & am == 1, c("mpg", "am", "gear", "cyl")]
)

So many ways of subsetting … an overview

Extract a variable from a data frame (and get a numeric or character vector)

mtcars[["mpg"]][mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1]

mtcars$mpg[mtcars$cyl == 6 & mtcars$am == 1]

subset(mtcars, cyl == 6 & am == 1, mpg, drop = TRUE)

with(mtcars, mpg[cyl == 6 & am == 1])

Odd behaviour:

For base R data frames this creates a vector:

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, "mpg"]
[1] 21.0 21.0 19.7

This should have resulted in a data frame with one variable but is automatically reduced to a vector.
Add drop = FALSE to get standard behavior.

mtcars[mtcars[["cyl"]] == 6 & mtcars[["am"]] == 1, "mpg", drop = FALSE]
mpg
Mazda RX4 21.0
Mazda RX4 Wag 21.0
Ferrari Dino 19.7

Some modern implementations of data frames (like tibbles) changed this behavior.